Decision Trees (DT)

Geometric Intuition of Decision Trees: Axis-Parallel Hyperplanes

A Decision Tree is a flowchart-like structure that recursively splits data based on feature values. Each split creates decision boundaries that are parallel to the feature axes, which makes decision trees highly interpretable. Below, we explore how a decision tree models the given dataset.


Dataset Used

Outlook Temp. Humidity Windy Play Golf
Rainy Hot High False No
Rainy Hot High True No
Overcast Hot High False Yes
Sunny Mild High False Yes
Sunny Cool Normal False Yes
Sunny Cool Normal True No
Overcast Cool Normal True Yes
Rainy Mild High False No
Rainy Cool Normal False Yes
Sunny Mild Normal False Yes
Rainy Mild Normal True Yes
Overcast Mild High True Yes
Overcast Hot Normal False Yes
Sunny Mild High True No

This dataset contains four predictor variables (Outlook, Temperature, Humidity, and Windy) and a target variable (Play Golf?), which the decision tree uses to classify whether playing golf is feasible or not.


How the Decision Tree Works on This Dataset

  1. Step 1: Splitting on Outlook (First Decision Boundary)
    • The tree first checks Outlook because it is the most informative feature.
    • This results in three distinct groups in the feature space:
      • If Outlook = OvercastAlways "Yes" (No further splitting needed).
      • If Outlook = Sunny → Further splits occur based on Windy.
      • If Outlook = Rainy → Further splits occur based on Humidity.
  2. Step 2: Further Splitting on Other Features
    • For Sunny Days:
      • If it is Windy = False"Yes" (can play golf).
      • If it is Windy = True"No" (not suitable).
    • For Rainy Days:
      • If Humidity = High"No" (too humid to play).
      • If Humidity = Normal"Yes" (good conditions).
  3. Final Decision Boundaries
    • The dataset is now divided into smaller rectangular decision regions, each corresponding to a final classification (Yes/No).
    • Each leaf node in the decision tree represents a final classification for a set of feature values.

Geometric Intuition: How the Tree Creates Decision Boundaries

  • The decision tree divides the feature space into axis-parallel regions.
  • Each split corresponds to a vertical or horizontal boundary in the space of selected features.
  • The resulting decision regions are rectangular and aligned with feature axes.
  • This structure is different from models like logistic regression, which create diagonal decision boundaries.

Example: Decision Boundaries for Outlook & Humidity

If we visualize a 2D feature space with Outlook on the x-axis and Humidity on the y-axis:

  • The first split at Outlook = Overcast creates a vertical boundary separating "Yes" from the rest.
  • The second split at Humidity = High for Rainy days creates a horizontal boundary in the Rainy region.
  • Further splits (like Windy for Sunny days) refine the classification into more rectangular segments.

Key Takeaways

  • Decision Trees create axis-aligned decision boundaries (rectangular regions).
  • Each split represents an IF-ELSE condition, forming nested rules.
  • Deeper trees create more refined partitions, improving accuracy but risking overfitting.

This makes Decision Trees intuitive and interpretable, but they may struggle when dealing with complex, non-rectangular patterns in data. Would you like a visual example of the decision regions? 🚀

Explain Entropy?

Entropy

  • Entropy is a measure of unpredictability. A feature with higher entropy has more random values than the one with lower entropy.
  • Suppose we tossed a coin 4 times,
Output P(H) P(T) Entropy Interpretation
{H, H, T, T} 50% 50% 1 Output is a random event
{T, T, T, H} 25% 75% 0.8113 The result is less random i.e., biased coin
  • For a given random variable Y which has K values, then entropy of Y is:

H(y) = - Σ P(yi) logb (P(yi))

P(yi) = P(Y = yi)

typ. b = 2 (log) or b = e (ln)

  • For example, for the above play golf example, there are two classes:

H(y) = -P(y = yes) log (P(y = yes)) - P(y = no) log (P(y = no))

H(y) = - (9/14) log (9/14) - (5/14) log (5/14) = 0.94

Comparing entropies for different cases for 2-class example

Entropy graph for 2-class example
Sr. No. P(Y+) P(Y-) Entropy
1 99% 1% 0.08
2 50% 50% 1.00
3 100% 0% 0.00

If both classes are equally probable, we have a maximum entropy value of 1.

If one class fully dominates, the entropy value becomes 0.

Entropy graph for 2-class example

Comparing entropies for different cases for Multi-class example

  • If all classes are equiprobable, the entropy will be maximum, i.e., uniform distribution.
Uniform distribution entropy graph
  • If one class has 100% probability and others are zero, the entropy is minimum.
  • In this figure, entropy of skewed distribution will be lesser than that of normally distributed. More peaked is the distribution, lesser will be the entropy.

Kullback-Leibler Divergence or relative entropy

KL Divergence Graph
  • KL Divergence measures the difference between two probability distributions over the same variable X.
  • For Discrete probability distributions P & Q on the same probability space X

\( D_{KL}(P || Q) = \sum_{x \in X} p(x) \log \frac{p(x)}{q(x)} \)

  • For distributions P and Q of a continuous random variable

\( D_{KL}(P || Q) = \int_{-\infty}^{\infty} p(x) \log \frac{p(x)}{q(x)} \,dx \)

p(x) and q(x) are probability densities of P & Q

Explain Information Gain

Information Gain

The information gain is based on the decrease in entropy after a dataset is split on an attribute. Constructing a decision tree is all about finding the attribute that returns the highest information gain (i.e., the most homogeneous branches).

Example:

  • As we know, entropy of golf example was 0.94

\[ H(y) = -P(y = \text{yes}) \log P(y = \text{yes}) - P(y = \text{no}) \log P(y = \text{no}) \]

\[ H(y) = - \left( \frac{9}{14} \log \frac{9}{14} + \frac{5}{14} \log \frac{5}{14} \right) = 0.94 \]

The dataset is then split on the different attributes. The entropy for each branch is calculated. Then it is added proportionally to get total entropy for the split. The resulting entropy is subtracted from the entropy before the split. The result is the Information Gain, or decrease in entropy.

Outlook vs Play Golf

Outlook Yes No Entropy W.E(Play, Outlook)
Sunny 3 2 0.9710 0.3468
Overcast 4 0 0.0000 0.0000
Rainy 3 2 0.9710 0.3468
Gain 0.246

Humidity vs Play Golf

Humidity Yes No Entropy W.E(Play, Humidity)
High 3 4 0.9852 0.4926
Normal 6 1 0.5917 0.2958
Gain 0.152

Temperature vs Play Golf

Temp Yes No Entropy W.E(Play, Temp)
Hot 2 2 1.0000 0.2857
Mild 4 2 0.9183 0.3936
Cool 3 1 0.8113 0.2318
Gain 0.029

Windy vs Play Golf

Windy Yes No Entropy W.E(Play, Windy)
FALSE 6 2 0.8113 0.4636
TRUE 3 3 1.0000 0.4286
Gain 0.048

\[ \text{Gain}(T, X) = \text{Entropy}(T) - W.\text{Entropy}(T, X) \]

\[ G(\text{Play Golf}, \text{Outlook}) = \text{Entropy}(\text{Play Golf}) - W.\text{Entropy}(\text{Play Golf}, \text{Outlook}) \]

\[ 0.94 - 0.693 = 0.247 \]

\[ IG(y, D_i) = \sum_{i=1}^{n} \frac{|D_i|}{|D|} H_{D_i}(y) - H_D(y) \]

Build a Decision Tree using Information Gain with an Example

Step 1: Compute Entropy for the Target Variable (Play Golf?)

Entropy formula:

H(S) = - ∑ p(i) log₂ (p(i))

From the dataset:

  • Total samples = 14
  • Yes (Play Golf) = 9
  • No (Play Golf) = 5

H(S) = - (9/14 log₂ (9/14) + 5/14 log₂ (5/14))

H(S) = - (0.642 × -0.459 + 0.358 × -0.807) = 0.940

Step 2: Compute Information Gain for Each Attribute

(a) Information Gain for "Outlook"

Splitting based on Outlook:

  • Sunny: (3 No, 2 Yes) → Entropy = 0.970
  • Overcast: (4 Yes) → Entropy = 0.000 (pure)
  • Rainy: (3 Yes, 2 No) → Entropy = 0.970

H(Outlook) = (5/14 × 0.970) + (4/14 × 0.000) + (5/14 × 0.970) = 0.693

IG(Outlook) = H(S) - H(Outlook) = 0.940 - 0.693 = 0.247

(b) Information Gain for "Humidity"

  • High: (3 No, 4 Yes) → Entropy = 0.985
  • Normal: (6 Yes, 1 No) → Entropy = 0.592

H(Humidity) = (7/14 × 0.985) + (7/14 × 0.592) = 0.788

IG(Humidity) = H(S) - H(Humidity) = 0.940 - 0.788 = 0.152

(c) Information Gain for "Wind"

  • Weak: (6 Yes, 2 No) → Entropy = 0.811
  • Strong: (3 Yes, 3 No) → Entropy = 1.000

H(Wind) = (8/14 × 0.811) + (6/14 × 1.000) = 0.892

IG(Wind) = H(S) - H(Wind) = 0.940 - 0.892 = 0.048

Step 3: Choose the Best Attribute

  • Outlook: 0.247 (Highest Information Gain)
  • Humidity: 0.152
  • Wind: 0.048

Since Outlook has the highest IG, it becomes the root node.

Step 4: Split and Recursively Repeat

Now, we divide the dataset based on Outlook:

(a) Subtree for "Overcast"

All are Yes, so it's a leaf node.

(b) Subtree for "Rainy"

Choices:

  • Humidity: 0.151
  • Wind: 0.020

Since Humidity has the highest IG, we split on Humidity.

(c) Subtree for "Sunny"

Humidity: 0.970

Splitting on Humidity.

Summary

  • Calculated Entropy for the target variable.
  • Computed Information Gain for each feature.
  • Chose the best attribute (Outlook) as the root node.
  • Repeated the process recursively for sub-nodes.

R₁: IF (Outlook=Sunny) AND (Windy=FALSE) THEN Play=Yes

R₂: IF (Outlook=Sunny) AND (Windy=TRUE) THEN Play=No

R₃: IF (Outlook=Overcast) THEN Play=Yes

R₄: IF (Outlook=Rainy) AND (Humidity=High) THEN Play=No

R₅: IF (Outlook=Rainy) AND (Humidity=Normal) THEN Play=Yes

                          Outlook
                          //   ||  \\
                                ________//    ||   \\_____________
                                |            |                |
                             Sunny         Overcast          Rainy
                               |              |               |
                              Windy        Play=Yes          Humidity
                            ____|____                     ____|____
                            |        |                    |         |
                          False    True                High      Normal
                        Play=Yes  Play=No           Play=No    Play=Yes
                                    

Explain Gini Impurity

Gini Impurity

Gini Impurity is a metric used to measure how often a randomly chosen element from the set would be incorrectly classified if it were randomly labeled according to the class distribution in the subset.

It is used in decision trees, especially in algorithms like **CART (Classification and Regression Trees)**, to decide the best split at each node.

Formula:

\[ Gini(D) = 1 - \sum_{i=1}^{c} P_i^2 \]

Where:

  • \(P_i\) is the probability of class \(i\) in the dataset.
  • \(c\) is the total number of classes.

Example:

  • In the Play Golf dataset, we have **Yes (9)** and **No (5)** instances.

\[ Gini = 1 - \left( \left(\frac{9}{14}\right)^2 + \left(\frac{5}{14}\right)^2 \right) \]

\[ Gini = 1 - \left(0.408 + 0.127\right) = 0.465 \]

The lower the Gini Impurity, the purer the dataset.

Outlook vs Play Golf

Outlook Yes No Gini W.G(Play, Outlook)
Sunny 3 2 0.48 0.171
Overcast 4 0 0.00 0.000
Rainy 3 2 0.48 0.171
Gini Gain 0.122

Humidity vs Play Golf

Humidity Yes No Gini W.G(Play, Humidity)
High 3 4 0.49 0.245
Normal 6 1 0.24 0.120
Gini Gain 0.100

Temperature vs Play Golf

Temp Yes No Gini W.G(Play, Temp)
Hot 2 2 0.50 0.143
Mild 4 2 0.44 0.189
Cool 3 1 0.38 0.136
Gini Gain 0.033

Windy vs Play Golf

Windy Yes No Gini W.G(Play, Windy)
FALSE 6 2 0.38 0.218
TRUE 3 3 0.50 0.214
Gini Gain 0.033

\[ Gini\ Gain(T, X) = Gini(T) - W.Gini(T, X) \]

\[ Gini(\text{Play Golf}, \text{Outlook}) = Gini(\text{Play Golf}) - W.Gini(\text{Play Golf}, \text{Outlook}) \]

\[ 0.465 - 0.343 = 0.122 \]

Summary

  • Gini Impurity measures how mixed the classes are in a dataset.
  • The lower the Gini value, the purer the dataset.
  • Decision trees using CART split on the attribute with the **lowest** weighted Gini Impurity.

Overfitting vs. Underfitting in Decision Trees

Overfitting

  • Definition: The model learns the training data too well, capturing noise and irrelevant patterns.
  • Causes: Very deep trees, lack of pruning, too many features.
  • Effects: High training accuracy but poor test accuracy.
  • Solutions: Pruning, limiting tree depth, using ensemble methods like Random Forest.

Underfitting

  • Definition: The model is too simple and fails to capture patterns in the data.
  • Causes: Shallow trees, too much pruning, not using enough features.
  • Effects: Poor performance on both training and test data.
  • Solutions: Increase tree depth, use more features, reduce pruning.

Comparison Table

Criteria Overfitting Underfitting
Model Complexity Too complex Too simple
Training Accuracy High Low
Test Accuracy Low (poor generalization) Low (poor learning)
Solution Pruning, limiting depth Increase depth, reduce pruning